Logistic Regression and Support Vector Machines

This blog post provides a detailed explanation of the key concepts Logistic Regression and Support Vector Machines (SVMs). These are fundamental algorithms in machine learning for classification tasks.

1. Logistic Regression

Logistic Regression is a probabilistic classification model used primarily for binary classification problems. Unlike linear regression, which predicts continuous values, logistic regression outputs the probability that a given input belongs to a particular class (typically the positive class, labeled as y=1). This probability is constrained to the interval [0, 1].

1.1 From Probability to Odds to Log-Odds

Directly modeling probability p (where 0 ≤ p ≤ 1) is challenging because linear models can produce values outside this range. To address this, we transform the probability:

Odds: Defined as p / (1 – p), odds range from 0 to ∞.
Log-Odds (Logit): The natural logarithm of odds, log(p / (1 – p)), which ranges from -∞ to +∞. This unrestricted range allows us to model it as a linear function of the features.

Transformation from probability to log-odds

Figure 1: Illustration of the transformation from probability to odds and log-odds.

1.2 The Logistic (Sigmoid) Function

By setting the log-odds equal to a linear combination of features, we derive the logistic function:

$$ p(y=1 \mid \mathbf{x}) = \sigma(\mathbf{w}^T \mathbf{x} + w_0) = \frac{1}{1 + e^{-(\mathbf{w}^T \mathbf{x} + w_0)}} $$

Here, $\sigma(z)$ is the sigmoid function, which maps any real number z to [0, 1], producing an S-shaped curve.

Figure 2: The sigmoid function mapping real values to probabilities in [0,1].

Figure 3: Additional visualization of the sigmoid curve.

1.3 Handling Categorical Variables

Real-world datasets often include categorical features:

Ordinal variables (ordered categories, e.g., grades A-F): Map to ordered numbers (A=4, B=3, etc.).
Nominal variables (unordered, e.g., eye color): Use one-hot encoding to create binary vectors, avoiding implicit ordering.

Example: Eye colors {blue, green, brown} → blue: [1, 0, 0], green: [0, 1, 0], brown: [0, 0, 1]. (Often drop one category to avoid redundancy.)

Figure 4: Example of one-hot encoding for categorical variables.

1.4 Training Logistic Regression: Maximum Likelihood Estimation

Parameters $\mathbf{w}$ and $w_0$ are learned by maximizing the likelihood of observing the training data. The log-likelihood function is:

$$ L(\mathbf{w}) = \sum_{i=1}^n \left[ y_i \log p_i + (1 – y_i) \log (1 – p_i) \right] $$

where $ p_i = \sigma(\mathbf{w}^T \mathbf{x}_i + w_0) $.

This is equivalent to minimizing the negative log-likelihood (cross-entropy loss). Since the function is concave, optimization uses iterative methods like gradient descent (no closed-form solution exists).

2. Support Vector Machines (SVMs)

Support Vector Machines are discriminative classifiers that directly learn a decision boundary, outputting class labels (+1 or -1) without probabilities.

2.1 Linear SVM Decision Boundary

The decision rule is:

$$ \hat{y} = \sign(\mathbf{w}^T \mathbf{x} + b) $$

This defines a hyperplane $\mathbf{w}^T \mathbf{x} + b = 0$.

2.2 Choosing the Optimal Boundary: Maximum Margin

Multiple hyperplanes may separate the data perfectly. SVM selects the one maximizing the margin—the distance to the nearest points (support vectors)—for better generalization and robustness to noise.

Figure 5: Comparison of possible separating hyperplanes; the maximum margin one is most robust.

Figure 6: Illustration highlighting the maximum margin principle.

Figure 7: SVM hyperplane with maximum margin and highlighted support vectors.

Figure 8: Detailed view of the maximum margin hyperplane and support vectors.

2.3 Margin Calculation and Optimization

The margins are parallel hyperplanes: $\mathbf{w}^T \mathbf{x} + b = 1$ (positive) and $\mathbf{w}^T \mathbf{x} + b = -1$ (negative).

The distance from a point to the hyperplane is $ \frac{|\mathbf{w}^T \mathbf{x} + b|}{||\mathbf{w}||} $. For support vectors, this is $ \frac{1}{||\mathbf{w}||} $, so total margin is $ \frac{2}{||\mathbf{w}||} $.

SVM maximizes the margin by minimizing $ ||\mathbf{w}|| $ (or $ \frac{1}{2} ||\mathbf{w}||^2 $) subject to:

$$ y_i (\mathbf{w}^T \mathbf{x}_i + b) \geq 1 \quad \forall i $$

Figure 9: Diagram showing margin calculation and distance to hyperplane.

This constrained optimization yields a robust classifier focused on the most critical data points (support vectors).

Conclusion

Logistic Regression provides probabilistic outputs ideal for interpreting confidence, while SVMs excel in finding robust boundaries for high-dimensional data. Both are cornerstone algorithms in supervised learning.